Lecture 21
bench# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 d[d$x > 0.5, ] 75.8µs 90.8µs 10784. 234.56KB 42.0
2 d[which(d$x > 0.5), ] 78.2µs 94.8µs 10396. 268.33KB 78.2
3 subset(d, x > 0.5) 93.6µs 116.8µs 8267. 287.09KB 61.3
4 filter(d, x > 0.5) 249.4µs 285µs 3457. 1.47MB 25.8
# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 d[d$x > 0.5, ] 7.54ms 7.99ms 125. 13.3MB 54.2
2 d[which(d$x > 0.5), ] 9.11ms 9.35ms 107. 24.8MB 112.
3 subset(d, x > 0.5) 11.61ms 12.11ms 80.9 24.8MB 91.0
4 filter(d, x > 0.5) 9.5ms 10.37ms 94.8 24.8MB 105.
bench - relative results# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 d[d$x > 0.5, ] 1 1 1.55 1 1
2 d[which(d$x > 0.5), ] 1.21 1.17 1.32 1.86 2.06
3 subset(d, x > 0.5) 1.54 1.52 1 1.86 1.68
4 filter(d, x > 0.5) 1.26 1.30 1.17 1.86 1.94
t.testImagine we have run 1000 experiments (rows), each of which collects data on 50 individuals (columns). The first 25 individuals in each experiment are assigned to group 1 and the rest to group 2.
The goal is to calculate the t-statistic for each experiment comparing group 1 to group 2.
# A tibble: 50 × 1,002
ind group exp1 exp2 exp3 exp4 exp5 exp6
<int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 2.83 7.90 8.39 10.5 11.5 11.5
2 2 1 14.5 8.59 11.8 9.75 9.34 4.43
3 3 1 8.74 6.79 7.33 8.43 7.66 11.5
4 4 1 13.1 5.51 6.39 9.67 8.55 16.3
5 5 1 4.51 13.1 13.4 4.57 2.89 10.4
6 6 1 11.3 5.18 12.2 5.59 10.7 12.2
7 7 1 9.65 8.53 5.32 11.8 11.5 7.63
8 8 1 13.2 12.6 8.05 16.2 12.9 10.5
9 9 1 7.83 15.0 10.2 11.1 16.2 13.2
10 10 1 12.5 11.9 9.07 8.59 9.87 12.6
# ℹ 40 more rows
# ℹ 994 more variables: exp7 <dbl>, exp8 <dbl>,
# exp9 <dbl>, exp10 <dbl>, exp11 <dbl>,
# exp12 <dbl>, exp13 <dbl>, exp14 <dbl>,
# exp15 <dbl>, exp16 <dbl>, exp17 <dbl>,
# exp18 <dbl>, exp19 <dbl>, exp20 <dbl>,
# exp21 <dbl>, exp22 <dbl>, exp23 <dbl>, …
ttest_hand_calc = function(X) {
f = function(x, grp) {
t_stat = function(x) {
m = mean(x)
n = length(x)
var = sum((x - m) ^ 2) / (n - 1)
list(m = m, n = n, var = var)
}
g1 = t_stat(x[grp == 1])
g2 = t_stat(x[grp == 2])
se_total = sqrt(g1$var / g1$n + g2$var / g2$n)
(g1$m - g2$m) / se_total
}
apply(X[,-(1:2)], 2, f, X$group)
}
system.time(ttest_hand_calc(X)) user system elapsed
0.010 0.000 0.011
bench::mark(
ttest_formula(X, m),
ttest_for(X, m),
ttest_apply(X),
ttest_hand_calc(X),
check=FALSE
)Warning: Some expressions had a GC in every iteration; so filtering
is disabled.
# A tibble: 4 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 ttest_formula(X, m) 137.04ms 139.3ms 7.14 8.24MB 14.3
2 ttest_for(X, m) 46.77ms 53.69ms 19.1 1.91MB 15.3
3 ttest_apply(X) 39.74ms 45.43ms 21.9 3.48MB 14.0
4 ttest_hand_calc(X) 6.35ms 6.66ms 127. 3.44MB 15.8
parallelPart of the base packages in R
tools for the forking of R processes (some functions do not work on Windows)
Core functions:
detectCores
pvec
mclapply
mcparallel & mccollect
detectCoresSurprisingly, detects the number of cores of the current system.
Parallelization of a vectorized function call
user system elapsed
0.011 0.007 0.018
user system elapsed
0.211 0.876 1.005
user system elapsed
0.087 0.842 0.816
bench::system_timeimplements a parallelized version of lapply
Asynchronously evaluation of an R expression in a separate process
Checks mcparallel objects for completion
$`33198`
[1] 0.002961617
Warning in selectChildren(jobs, timeout): cannot wait for child 33198
as it does not exist
NULL
Warning in selectChildren(jobs, timeout): cannot wait for child 33198
as it does not exist
NULL
Packages by Revolution Analytics that provides the foreach function which is a parallelizable for loop (and then some).
Core functions:
registerDoMC
foreach, %dopar%, %do%
registerDoMCPrimarily used to set the number of cores used by foreach, by default uses options("cores") or half the number of cores found by detectCores from the parallel package.
foreachA slightly more powerful version of base for loops (think for with an lapply flavor). Combined with %do% or %dopar% for single or multicore execution.
foreach - iteratorsforeach can iterate across more than one value, but it doesn’t do length coercion
foreach - combining resultsforeach - parallelizationSwapping out %do% for %dopar% will use the parallel backend.
user system elapsed
0.124 0.035 0.080
user system elapsed
0.149 0.044 0.061
user system elapsed
0.186 0.050 0.050
user system elapsed
0.000 0.000 3.012
user system elapsed
0.032 0.008 3.074
Bootstrapping is a resampling scheme where the original data is repeatedly reconstructed by taking a samples of size n (with replacement) from the original data, and using that to repeat an analysis procedure of interest. Below is an example of fitting a local regression (loess) to some synthetic data, we will construct a bootstrap prediction interval for this model.
Optimal use of parallelization / multiple cores is hard, there isn’t one best solution
Don’t underestimate the overhead cost
Experimentation is key
Measure it or it didn’t happen
Be aware of the trade off between developer time and run time
An awful lot of statistics is at its core linear algebra.
For example:
\[ \hat{\beta} = (X^T X)^{-1} X^Ty \]
Principle component analysis
Find \(T = XW\) where \(W\) is a matrix whose columns are the eigenvectors of \(X^TX\).
Often solved via SVD - Let \(X = U\Sigma W^T\) then \(T = U\Sigma\).
Not unique to Statistics, these are the type of problems that come up across all areas of numerical computing.
Numerical linear algebra \(\ne\) mathematical linear algebra
Efficiency and stability of numerical algorithms matter
Don’t reinvent the wheel - common core linear algebra tools (well defined API)
Low level algorithms for common linear algebra operations
Basic Linear Algebra Subprograms
Copying, scaling, multiplying vectors and matrices
Origins go back to 1979, written in Fortran
Linear Algebra Package
Higher level functionality building on BLAS.
Linear solvers, eigenvalues, and matrix decompositions
Origins go back to 1992, mostly Fortran (expanded on LINPACK, EISPACK)
Most default BLAS and LAPACK implementations (like R’s defaults) are somewhat dated
Written in Fortran and designed for a single cpu core
Certain (potentially non-optimal) hard coded defaults (e.g. block size).
Multithreaded alternatives:
ATLAS - Automatically Tuned Linear Algebra Software
OpenBLAS - fork of GotoBLAS from TACC at UTexas
Intel MKL - Math Kernel Library, part of Intel’s commercial compiler tools
cuBLAS / Magma - GPU libraries from Nvidia and UTK respectively
Accelerate / vecLib - Apple’s framework for GPU and multicore computing
| n | 1 core | 2 cores | 4 cores | 8 cores | 16 cores |
|---|---|---|---|---|---|
| 100 | 0.000 | 0.000 | 0.000 | 0.000 | 0.000 |
| 500 | 0.004 | 0.003 | 0.002 | 0.002 | 0.004 |
| 1000 | 0.028 | 0.016 | 0.010 | 0.007 | 0.009 |
| 2000 | 0.207 | 0.110 | 0.058 | 0.035 | 0.039 |
| 3000 | 0.679 | 0.352 | 0.183 | 0.103 | 0.081 |
| 4000 | 1.587 | 0.816 | 0.418 | 0.227 | 0.145 |
| 5000 | 3.104 | 1.583 | 0.807 | 0.453 | 0.266 |
Sta 323 - Spring 2025